Using Bloom Filters to Refine Web Search Results
نویسندگان
چکیده
Search engines have primarily focused on presenting the most relevant pages to the user quickly. A less well explored aspect of improving the search experience is to remove or group all near-duplicate documents in the results presented to the user. In this paper, we apply a Bloom filter based similarity detection technique to address this issue by refining the search results presented to the user. First, we present and analyze our technique for finding similar documents using contentdefined chunking and Bloom filters, and demonstrate its effectiveness in compactly representing and quickly matching pages for similarity testing. Later, we demonstrate how a number of results of popular and random search queries retrieved from different search engines, Google, Yahoo, MSN, are similar and can be eliminated or re-organized. Finally, we apply our near-duplicate detection technique to show how to effectively remove similar search results and improve user experience.
منابع مشابه
Privacy-Enhanced Searches Using Encrypted Bloom Filters
It is often necessary for two or more or more parties that do not fully trust each other to selectively share data. We propose a search scheme based on Bloom filters and Pohlig-Hellman encryption. A semi-trusted third party can transform one party’s search queries to a form suitable for querying the other party’s database, in such a way that neither the third party nor the database owner can se...
متن کاملBloom Cookies: Web Search Personalization without User Tracking
We propose Bloom cookies that encode a user’s profile in a compact and privacy-preserving way, without preventing online services from using it for personalization purposes. The Bloom cookies design is inspired by our analysis of a large set of web search logs that shows drawbacks of two profile obfuscation techniques, namely profile generalization and noise injection, today used by many privac...
متن کاملP-LUPOSDATE: Using Precomputed Bloom Filters to Speed Up SPARQL Processing in the Cloud
Increasingly data on the Web is stored in the form of Semantic Web data. Because of today’s information overload, it becomes very important to store and query these big datasets in a scalable way and hence in a distributed fashion. Cloud Computing offers such a distributed environment with dynamic reallocation of computing and storing resources based on needs. In this work we introduce a scalab...
متن کاملAnytime Query Answering in RDF through Evolutionary Algorithms
We present a technique for answering queries over RDF data through an evolutionary search algorithm, using fingerprinting and Bloom filters for rapid approximate evaluation of generated solutions. Our evolutionary approach has several advantages compared to traditional databasestyle query answering. First, the result quality increases monotonically and converges with each evolution, offering “a...
متن کاملScaling Filename Queries in a Large-Scale Distributed File System
We have examined the tradeoffs in applying regular and Compressed Bloom filters to the name query problem in distributed file systems and developed and tested a novel mechanism for scaling queries as the network grows large. Filters greatly reduced query messages when using Fan’s “Summary Cache” in web cache hierarchies[6], a similar albeit smaller, searching problem. We have implemented a test...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005